In our first plot, linoleic is a continous variable with colored in hue’s of blue. The channel capacity for hue is only 10 levels and the problem of occusion with the data also deteriorates our understanding of the plot. In the second plot, We have linoleic variable segmented into 4 groups, this gives us a quicker understanding of the data showing us the relative values for the groups. The perception problem of relative judgement is affected as color hue comes with the highest error in human beings.
ol = read.csv("olive.csv", header = T, row.names = 1)
head(ol, 10)
## Region Area palmitic palmitoleic stearic oleic linoleic
## 1 1 North-Apulia 1075 75 226 7823 672
## 2 1 North-Apulia 1088 73 224 7709 781
## 3 1 North-Apulia 911 54 246 8113 549
## 4 1 North-Apulia 966 57 240 7952 619
## 5 1 North-Apulia 1051 67 259 7771 672
## 6 1 North-Apulia 911 49 268 7924 678
## 7 1 North-Apulia 922 66 264 7990 618
## 8 1 North-Apulia 1100 61 235 7728 734
## 9 1 North-Apulia 1082 60 239 7745 709
## 10 1 North-Apulia 1037 55 213 7944 633
## linolenic arachidic eicosenoic
## 1 36 60 29
## 2 31 61 29
## 3 31 63 29
## 4 50 78 35
## 5 50 80 46
## 6 51 70 44
## 7 49 56 29
## 8 39 64 35
## 9 46 83 33
## 10 26 52 30
ggplot(ol, aes(palmitic, oleic, col = linolenic)) + geom_point()
ol$disc = cut_interval(ol$linolenic, 4)
ggplot(ol, aes(palmitic, oleic, col = disc)) + geom_point()
The 2nd plot is the easiest to analyse the plot with linolenic segmented into 4 groups. The size mapping creates the problem of occlusion due to overlapping. The orientaion angle map does not help either as the scatter plot many observations creates a high relative judement error. a)With Color hue 10 levels of feature can be percieved and 3.1bits can be decoded,With Color Brightness 5 levels and 2.1bits can be decoded. b)With size of object 4-5levels of feature can be percieved depending on human subject’s individualistic abilities, and 2.2bits can be decoded for this aesthetic. c)line orientation : 3bits can be decoded for this feature.
ggplot(ol, aes(palmitic, oleic, col = ol$disc)) + geom_point()
ggplot(ol, aes(palmitic, oleic, size = ol$disc)) + geom_point()
levels(ol$disc)<-(0:3)*(pi/4)
ol$disc<-as.numeric(as.character(ol$disc))
ggplot(ol, aes(palmitic, oleic)) + geom_point() +
geom_spoke(angle = ol$disc, radius = 40)+
ggtitle("Scatter plot of palmitic vs oleic discretized
by linolenic Orientation angle")
In the first plot, Region is considered numeric and plotted with color brightness. This makes it apparent that the different regions are interrelated but actually, no such relationship exists as Region is a categorical variable. Treisman’s theory of preattentive processing is showcased in this example, with the second plot we see the same much quickly due to preattentive preprocessing of contrast and luminance as color hue is mapped to the categorical variable Region.
ggplot(ol, aes(oleic, eicosenoic, col = Region)) + geom_point()
ggplot(ol, aes(oleic, eicosenoic, col = cut_interval(Region,3))) + geom_point()
The 3 colors are each mapped with contrast and size and these feature maps are parallely processed in our brain as this creates a problem while analysing 27 different types of observation with 3 levels of mapping. Human channel capacity is limited to 10 levels of hue (3.1 bits), 5 levels of brightness (2.3 bits) and 4 to 5 levels of size (2.2 bits). On an average, we are also limited to 6-7 levels of different observations (2.6 bits). When using multiple mapping all at once, the channel capacity does not linearly increase as the sum of their their individual channel capacities. With size,brightness and hue used together, the channel capacity is 4.1 bits but the sum of the channel capacities is 7.6bits. Due to this, we cannot interpret the plot easily with preattentive preprocessing.
ggplot(ol, aes(ol$oleic, ol$eicosenoic, col =
cut_interval(ol$linoleic, 3),
shape = cut_interval(ol$palmitic, 3),
size = cut_interval(ol$palmitoleic, 3))) +
geom_point()
Size, contrast and shape are individual feature maps that are linked to different colors and hence preattentive preprocessing helps in this case. We can see a clear decision boundary amongst different regions. According to Triesman’s theory, the human visual system splits different features into separate maps and processes them in parallel. This enables the system to ignore non-target information contained in the master map. This is seen here as the Region variable has clear decision boundaries that can be immediately observed with the respective color. This makes identify the clusters based on Region inspite of many other features in the plot.
ggplot(ol, aes(ol$oleic, ol$eicosenoic, col = ol$Region,
shape = cut_interval(ol$palmitic, 3),
size = cut_interval(ol$palmitoleic, 3))) + geom_point()
Relative Judgement due to area is very high due to the plot made as a pie size as the dominant group of South- Apulia looks much larger than the other groups.
p <- plot_ly(ol, labels = ~Area, type = 'pie', showlegend = TRUE, textinfo="text", text="") %>%
layout(title = 'Pie Chart Area')
p
It is hard to look for outliers in the contour plot compared to the scatter plot. The extreme values are not plotted in the contour plot. It is also hard to figure out clusters in the contour plot compared to a scatter plot which is a big issue in this plot.
In the contour plot it shows we have 5 peak values but you wont be able to spot any difference for it in the scatter plot. At some of the peaks shown by the contour plot there are no points there in the scatter plot, which can be very misleading. Like the peak in contour at approximately (900, 12) there are no corresponding points there in the contour plot. So contour plots can be misleading sometimes.
ggplot(ol, aes(ol$linoleic, ol$eicosenoic)) + geom_density2d()
ggplot(ol, aes(ol$linoleic, ol$eicosenoic)) + geom_point()
The columns vary a lot in the range. Some values like BAvg are averages so are in a range of 0.235 to 0.282, while values like TB are in the range 2090 to 2615. This is the reason scaling is required before we apply Non- metric MDS. Scaling the data gets all the values in the same range, this would allow the NMDS algorithm to reduce the dimensions of the data more efficiently.
## League Won Lost Runs.per.game HR.per.game AB Runs
## Aizona Diamondbacks NL 69 93 4.64 1.172840 5665 752
## Atlanta Braves NL 68 93 4.03 0.757764 5514 649
## Baltimore Orioles AL 89 73 4.59 1.561728 5524 744
## Boston Red Sox AL 93 69 5.42 1.283951 5670 878
## Chicago Cubs NL 103 58 4.99 1.236025 5503 808
## Chicago White Sox AL 78 84 4.23 1.037037 5550 686
## Hits X2B X3B HR RBI StolenB CaughtS BB SO BAvg
## Aizona Diamondbacks 1479 285 56 190 709 137 31 463 1427 0.261
## Atlanta Braves 1404 295 27 122 615 75 34 502 1240 0.255
## Baltimore Orioles 1413 265 6 253 710 19 13 468 1324 0.256
## Boston Red Sox 1598 343 25 208 836 83 24 558 1160 0.282
## Chicago Cubs 1409 293 30 199 767 66 34 656 1339 0.256
## Chicago White Sox 1428 277 33 168 656 77 36 455 1285 0.257
## OBP SLG OPS TB GDP HBP SH SF IBB LOB
## Aizona Diamondbacks 0.320 0.432 0.752 2446 117 50 43 38 43 1113
## Atlanta Braves 0.321 0.384 0.705 2119 145 59 64 52 60 1161
## Baltimore Orioles 0.317 0.443 0.760 2449 119 44 17 36 19 1065
## Boston Red Sox 0.348 0.461 0.810 2615 137 43 8 40 34 1162
## Chicago Cubs 0.343 0.429 0.772 2359 107 96 42 37 45 1217
## Chicago White Sox 0.317 0.410 0.727 2275 122 53 29 44 16 1105
It is hard to see a difference between the legues in this plot. We could say that the National League(NL) teams are spread out away from the origin, and the Anerican League(AL) teams are more centered towards the origin.
The points are well spread out so it is hard to tell if a MDS component is providing better differentiation between the leagues. In my opinion V1 was doing a better split between the leagues compared to V2.
According to this plot “Boston Red Sox” and “Atlanta Braves” look like outliers.
## initial value 19.856833
## iter 5 value 16.319153
## iter 10 value 16.046215
## final value 15.935476
## converged
MDS was able to decrease the stress value upto 15.6%. Given that the dataset had 26 dimension and getting it down to 2 dimensions, with stress level of 15.6 is good.
Some of the observation pairs that were hard for MDS to map were -
“Orkland Athletics and Milwaukee Brewers”, “NY Mets and Minnesota Twins”, “Minnesota Twins and Arizona Diamondbacks”, “Orkland Athletics and Chicago cubs”, “Pittsburg pirates and Chicago cubs”
Since V1 was spliting the leagues better, I plotted all the variables against it and found that “RBI” and “OPS” had a strong negative connection with V1. On searching for these on google we found, these two turned out to be really important factors in baseball to differentiate teams and rank them.
RBI(Runs batted in) - RBI is a statistic in baseball that credits a batter for making a play that allows a run to be scored. The top teams in the league have a high RBI. It is an important batting statistic in baseball.
OPS(On-base plus slugging) - OPS is a statistic calculated as a sum of players ability to get on base and hit with power. Usually the league leaders have players with highest OPS. This is an important factor that increases runs scored for a team.
Both of these RBI and OPS are important batting statistics in baseball.
plots_bball$P12 #TB **
plots_bball$P20 #OPS **
library(ggplot2)
library(plotly)
library(xlsx)
library(MASS)
library(gridExtra)
#Assignment 1
#Q1
ol = read.csv("olive.csv", header = T, row.names = 1)
head(ol, 10)
ggplot(ol, aes(palmitic, oleic, col = linolenic)) + geom_point()
disc = cut_interval(ol$linolenic, 4)
ggplot(ol, aes(palmitic, oleic, col = disc)) + geom_point()
#Q2
ggplot(ol, aes(palmitic, oleic, col = disc)) + geom_point()
ggplot(ol, aes(palmitic, oleic, size = disc)) + geom_point()
ggplot(ol, aes(palmitic, oleic)) + geom_point() +
geom_spoke(angle = ol$linolenic, radius = 40)
#Q3
ggplot(ol, aes(oleic, eicosenoic, col = Region)) + geom_point()
#Q4
ggplot(ol, aes(ol$oleic, ol$eicosenoic, col = cut_interval(ol$linoleic, 3),
shape = cut_interval(ol$palmitic, 3),
size = cut_interval(ol$palmitoleic, 3))) + geom_point()
#Q5
ggplot(ol, aes(ol$oleic, ol$eicosenoic, col = ol$Region,
shape = cut_interval(ol$palmitic, 3),
size = cut_interval(ol$palmitoleic, 3))) + geom_point()
#Q6
p <- plot_ly(ol, labels = ~Area, type = 'pie', showlegend = FALSE) %>%
layout(title = 'Pie Chart Area')
p
#Q7
ggplot(ol, aes(ol$linoleic, ol$eicosenoic)) + geom_density2d()
ggplot(ol, aes(ol$linoleic, ol$eicosenoic)) + geom_point()
#Assignment 2
#Q1
bball = read.xlsx("baseball-2016.xlsx", sheetName = "Sheet1", header = TRUE,
row.names = 1)
head(bball)
#Q2
bball.numeric = bball[,3:27]
distance = dist(bball.numeric)
res = isoMDS(distance, k=2, p=2)
coords = res$points
coordsMDS = as.data.frame(coords)
coordsMDS$name = rownames(coordsMDS)
coordsMDS$league = bball$League
plot_ly(coordsMDS, x=~V1, y=~V2, type="scatter", mode = "markers"
, hovertext=~name, color= ~league)
#Q3
sh <- Shepard(distance, coords)
delta <-as.numeric(distance)
D<- as.numeric(dist(coords))
n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n)
index1=as.numeric(index[lower.tri(index)])
n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n, byrow = T)
index2=as.numeric(index[lower.tri(index)])
plot_ly()%>%
add_markers(x=~delta, y=~D, hoverinfo = 'text',
text = ~paste('Obj1: ', rownames(bball)[index1],
'<br> Obj 2: ', rownames(bball)[index2]))%>%
add_lines(x=~sh$x, y=~sh$yf)
#Q4
bball$V1 = coordsMDS$V1
bball$V2 = coordsMDS$V2
cols_bball = colnames(bball)
dim(bball)[2]
plots_bball = list()
for(i in 2:27){
pl_name = paste("P", i, sep = '')
col_name = cols_bball[i]
plots_bball[[pl_name]] = ggplot(bball, aes_string("V1", col_name)) +
geom_point() + geom_line(aes(x=0))
}
grid.arrange(grobs = plots_bball, ncol = 6, nrow = 6)
plots_bball$P2
plots_bball$P3
plots_bball$P4 #Runs per game
plots_bball$P5
plots_bball$P6 #AB
plots_bball$P7 #Runs **
plots_bball$P8 #Hits **
plots_bball$P9
plots_bball$P10
plots_bball$P11
plots_bball$P12 #RBI **
plots_bball$P13
plots_bball$P14
plots_bball$P15
plots_bball$P16
plots_bball$P17 #BAvg
plots_bball$P18
plots_bball$P19 #SLG
plots_bball$P20 #OPS **
plots_bball$P21 #TB **
plots_bball$P22
plots_bball$P23
plots_bball$P24
plots_bball$P25
plots_bball$P26
plots_bball$P27
plots_bball2 = list()
for(i in 2:27){
pl_name = paste("P", i, sep = '')
col_name = cols_bball[i]
plots_bball2[[pl_name]] = ggplot(bball, aes_string("V2", col_name)) +
geom_point() + geom_line(aes(x=0))
}
grid.arrange(grobs = plots_bball2, ncol = 6, nrow = 6)
plots_bball2$P2
plots_bball2$P3
plots_bball2$P4
plots_bball2$P5
plots_bball2$P6
plots_bball2$P7
plots_bball2$P8
plots_bball2$P9
plots_bball2$P10
plots_bball2$P11
plots_bball2$P12
plots_bball2$P13
plots_bball2$P14
plots_bball2$P15
plots_bball2$P16
plots_bball2$P17
plots_bball2$P18
plots_bball2$P19
plots_bball2$P20
plots_bball2$P21
plots_bball2$P22
plots_bball2$P23
plots_bball2$P24
plots_bball2$P25
plots_bball2$P26
plots_bball2$P27